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ABSTRACT 

As adaptive tutoring systems grow inereasingly popular for the 
eompletion of elasswork and homework, it is erueial to assess the 
manner in whieh students are seored within these platforms. The 
majority of systems, ineluding ASSISTments, return the binary 
eorreetness of a student’s first attempt at solving eaeh problem. 
Yet for many teaehers, partial eredit is a valuable praetiee when 
eommon wrong answers, espeeially in the presenee of effort, 
deserve aeknowledgement. We present a grid seareh to analyze 
441 partial eredit models within ASSISTments in an attempt to 
optimize per unit penalization weights for hints and attempts. For 
eaeh model, algorithmieally determined partial eredit seores are 
used to bin problem performanee, using partial eredit to prediet 
binary eorreetness on the next question. An optimal range for 
penalization is diseussed and limitations are eonsidered. 
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1. INTRODUCTION 

Adaptive tutoring systems provide rieh feedbaek and an 
interaetive learning environment in whieh students ean exeel, 
while teaehers maintain data-driven elassrooms by using the 
systems as powerful assessment tools. Simultaneously, these 
platforms have opened the door for researehers eondueting 
minimally invasive edueational researeh at seale while offering 
new opportunities for student modeling. Still, they are eommonly 
restrieted to measuring performanee through binary eorreetness on 
eaeh problem. Arguably the most popular form of student 
modeling within eomputerized learning environments. Knowledge 
Traeing, is rooted in the binary eorreetness of eaeh opportunity or 
problem a student experienees within a given skill [1]. Knowledge 
Traeing (KT) drives the mastery-learning eomponent of renowned 
tutoring systems ineluding the Cognitive Tutor series, allowing 
for real time predietions of student knowledge, skill mastery, or 
next problem eorreetness [4]. Similar modeling methods eonsider 
variables that extend beyond eorreetness but rarely eseape the 
binary nature of the eonstruet, ineluding Item Response Theory 
[2] and Performanee Faetors Analysis [9]. By restrieting input to a 


binary metrie aeross questions, these modeling teehniques fail to 
eonsider a eontinuous metrie that is eommonplaee for many 
teaehers: partial eredit. 

Partial eredit seoring used within adaptive tutoring systems 
eould provide more individualized predietion and thus establish 
models with better fit. It is likely that binary eorreetness has 
remained the default for learning models due to the inherent 
diffieulty of defining a universal algorithm to generalize partial 
eredit seoring aeross platforms. Some of the onus may also fall on 
users’ familiarity with eurrent system protoeol; students tend to 
avoid using system feedbaek regardless of the benefits it may 
provide beeause requesting feedbaek results in seore penalization. 
However, the primary goal of these platforms is generally to 
promote student learning rather than simply aeting as an 
assessment tool, and thus, binary eorreetness is flawed. 

The present study eonsiders data from ASSISTments, an 
online adaptive tutoring system that provides assistanee and 
assessment to over 50,000 users around the world as a free serviee 
of Woreester Polyteehnie Institute. Researehers have previously 
used ASSISTments data to modify student-modeling teehniques 
in a variety of ways ineluding student level individualization [7], 
item level individualization [8], and the sequenee of student 
response attempts [3]. Previous work has also shown that naive 
algorithms and maximum likelihood tabling methods that eonsider 
hints and attempts to prediet next problem eorreetness ean be 
sueeessful in establishing partial eredit models meant to 
supplement KT [10; 11]. More reeently, algorithmieally derived 
partial eredit seoring resulted in stand-alone tabled models using 
data from only the most reeent question and yet showing goodness 
of fit measures on par with KT at lower proeessing eosts [6]. 
However, we hypothesize that some eoneeptualizations of partial 
eredit may lead to better predietive models than others. Rather 
than subjeetively defining tables or algorithms, a data driven 
approaeh should be eonsidered. Thus, eonsidering student 
performanee within the ASSISTments platform, the eurrent study 
employs a grid seareh on per unit penalizations of hints and 
attempts to ask: 

1. Based on penalties for hints and attempts dealt per unit, is it 
possible to algorithmieally define partial eredit seoring that 
optimizes the predietion of next problem eorreetness? 

2. Does the optimal model of partial eredit differ aeross 
different granularities of dataset analysis? 

Establishing an optimal partial eredit metrie within ASSISTments 
would allow teaehers using the tool to more aeeurately assess 
student knowledge and learning, while allowing students to alter 
their approaeh to system usage by taking advantage of adaptive 
feedbaek. The optimization of partial eredit seoring would also 
enhanee student modeling teehniques and offer a new approaeh to 
answering eomplex questions within the domain of edueational 
data mining. 
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2. DATA 

The ASSISTments dataset used for the present study is eomprised 
solely of assignments known as Skill Builders. This type of 
assignment requires students to eorreetly answer three eonseeutive 
questions to eomplete the problem set. Questions are randomly 
pulled from a large pool of skill eontent and are typieally 
presented with tutoring feedbaek, most eommonly in the form of 
hints. The dataset has been de-identified and is available at [5] for 
further investigation. 

The dataset used in the present study is a eompilation of Skill 
Builders from the 2012-2013 sehool year, eontaining data for 
866,862 solved problems. Reeorded data ineludes students’ 
performanee on the problem (i.e., binary eorreetness, hint eount, 
attempt eount), variables that identify the problem itself (i.e., 
problem type, unique problem identifieation number) and 
information pertaining to the assignment housing the problem 
(i.e., unique identifiers for assignments, skill type, teaehers, and 
sehools). The dataset was representative of 120 unique skills and 
24,912 unique problems, solved by 20,206 students. 

On average, students made 1.53 attempts per problem (SD = 
15.08). The minimum number of attempts was 0 (i.e., a student 
who opened the problem and then left the tutor), while the 
maximum number of attempts was a daunting 12,246 (i.e., a 
student who hit ‘Enter’ repeatedly for a prolonged period of time, 
likely out of frustration or boredom). Students made a total of 
1,324,226 attempts aeross all problems. The majority of problems 
(74.9%) had just one logged attempt per student (typieally eorreet 
answers), while 15.1% of problems earried only two logged 
attempts. 

Hint usage among all students averaged 0.61 hints per 
problem (SD = 1.29). The minimum number of hints used was 0 
(i.e., no feedbaek requested), while the maximum number of hints 
used was 10. Interestingly, the maximum number of hints 
available for any partieular problem was 7. Thus, a handful of 
students who logged more than 7 hints were aeeessing the tutor in 
multiple browser windows (i.e., eheating). On average there were 
3.22 hints available per problem (SD = 0.89). The majority of 
problems eontained 3 hints (44.6%), 4 hints (28.9%), or 2 hints 
(18.2%). Although there were 2,768,299 hints available aeross all 
problems, students only used 529,394 hints, or approximately 
19% of available feedbaek. Bottom out hints, or those providing 
the problem’s solution, were only used on 146,742 (16.9%) of 
problems. 

Additional analyses were performed on the 261,787 problems 
that students answered ineorreetly out of the original 866,862 
problems solved. Within this subset of data, students made an 
average of 2.75 attempts per problem (SD = 27.40). Students also 
used an average of 2.02 hints (SD = 1.63). This subset of 
problems had 860,131 total hints available, of whieh students used 
528,644 hints (61.5%). 

Hint usage would likely inerease if partial eredit seoring was 
implemented within the ASSISTments platform. In many 
elassrooms, binary first attempt seoring has ereated an 
environment in whieh students are afraid to use hints although 
they would benefit from feedbaek, as they know they will reeeive 
no eredit. Further, the dataset suggests that onee students are 
marked wrong, they are more likely to jump through all available 
hints and seek out the answer (56% of ineorreet first attempts led 
to bottom out hinting). This refleets another substantial downfall 
in the system’s eurrent protoeol: onee the risk has passed, so has 
the drive to learn. The implementation of partial eredit seoring has 
the potential to alleviate this misuse. 


3. METHODS 

The present study presents an extensive grid seareh of potential 
per hint and per attempt penalizations. The full dataset was used 
to define partial eredit seores algorithmieally based on per unit 
penalizations ranging from 0 to 1 in inerements of 0.05 for both 
hints and attempts. Thus, for eaeh solved problem in the dataset, 
441 partial eredit seores were established based on eaeh possible 
eombination of per unit penalization. For example, in a model in 
whieh eaeh attempt earned a penalization of 0.05, and eaeh hint 
earned a penalization of 0.1, a student who made three attempts 
and used one hint would reeeive a penalty of 0.25 ((3x0.05) + 
(1x0.1)), effeetively seoring 0.75 on that problem. This proeess 
was used to seore eaeh problem in the dataset for eaeh possible 
penalty eombination, with a floored per problem seore of 0 
(students eould not reeeive negative seores). This method was 
similar to that presented by Wang & Heffernan in the Assistanee 
Model [10] whieh established a tabling method to ealeulate 
probabilities of next problem eorreetness based on eombinations 
of hints and attempts that resulted in twelve possible bins or 
parameters. 

For eaeh of the 441 partial eredit models, a maximum 
likelihood tabling method was employed using five fold eross 
validation. Within eaeh model, a modulo operation was used on 
eaeh student’s unique identifieation number to assign students to 
one of five folds. Note that this method resulted in folds that all 
represented approximately 20% of students in the dataset. 
Maximum likelihood probabilities for next problem eorreetness 
were then ealeulated for eaeh partial eredit seore within eaeh 
model. Table 1 presents an average of test fold probabilities for 
the model in whieh eaeh attempt and eaeh hint are penalized 0.1. 
For instanee, a student using two attempts (2 x 0.1) and one hint 
(1 X 0.1) would be penalized 0.3, thus falling into the seore bin of 
0.7 (PC Seore). Following through with this example, based on 
11,174 problems solved that fit this seoring strueture, the average 
of known binary performanee on the following problem was 
0.599. This value beeomes the predietion for next problem 
eorreetness for students seoring 0.7 on the eurrent problem. 

Using the maximum likelihood probabilities for next problem 
eorreetness within eaeh test fold as predieted values, residuals 
were then ealeulated by subtraeting predietions direetly from 
aetual next problem binary eorreetness (i.e., 1 - 0.725 = 0.275; 0 - 
0.571 =-0.571). This approaeh was used rather than seleeting an 
arbitrary eutoff point to elassify a predietion as eorreet or 
ineorreet in the binary sense (i.e., values greater than or equal to 
0.6 serve as predietions of eorreetness) beeause it redueed the 
potential for researeher bias. 


Table 1. Probabilities averaged across test folds for the model 
in which the penalization per hint and per attempt is 0.1 


PC Score 

n 

Max. Likelihood NPC 

0 

149,504 

0.467 

0.1 

422 

0.571 

0.2 

685 

0.581 

0.3 

1,055 

0.578 

0.4 

1,784 

0.574 

0.5 

3,442 

0.583 

0.6 

6,623 

0.585 

0.7 

11,174 

0.599 

0.8 

18,679 

0.662 

0.9 

49,972 

0.725 

1.0 

476,523 

0.802 
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4. RESULTS 

For each model, residuals were used to calculate RMSE, & 
AUC at three levels of granularity: problem level, student level, 
and skill level. Heat maps are only presented here for RMSE, as 
the other metrics established almost identical maps. Metrics 
representing greater model fit are depicted using the purple end of 
the spectrum, while those representing poorer fit are represented 
using the red end of the spectrum. Further, a series of ANOVAs 
were conducted to compare each set of models within the same 
penalization level for attempts and hints. For example, the 21 
models in which attempt penalty was set to 0.2 were compared to 
all other sets of attempt penalty models to investigate significant 
differences across penalties. This method was used rather than 
comparing each model with all other models using paired samples 
t-tests, as the resulting 194,481 analyses (441^) would greatly 
inflate the rate of Type I error without unrealistic corrections. 

Initial analysis was performed at the problem level; residuals 
were calculated for each problem that contained next problem 
correctness metrics and goodness of fit measures were averaged 
across the dataset. Each metric followed a similar structure in 
which low attempt penalties appear to result in better fitting 
models, while hint penalty does not appear to be significant. Thus, 
partial credit scoring algorithms using lower penalties for attempts 
were better at predicting next problem performance, as depicted in 
Figure 1. The ANOVA results depicted in Table 2 suggest that 
differences in attempt penalty models were significant. Thus, the 
set of models with per attempt penalties of 0.1 differed 
significantly from the set of models with per attempt penalties of 
0.8. Differences among hint penalty models were not reliably 
significant. Figure 1 also suggests that the current binary scoring 
protocol used by ASSISTments results in predictive models that 
are inadequate. First attempt binary correctness is the equivalent 
of the model in which per attempt and per hint penalty are both 
set to 1, or the upper right corner of each heatmap). This model 
resulted in consistently poor fit metrics, suggesting that modeling 
techniques such as KT should employ continuous or binned 
partial credit values as input as they enhance next problem 
prediction ability. It has not yet been investigated how this 
alteration would change the prediction of other variables 
commonly predicted through KT, such as latent student 
knowledge or skill mastery. 

Student level analysis was undertaken using a subset of the 
original data file. At this granularity, goodness of fit metrics were 
calculated for each student and averaged across students to obtain 
final metrics for each of the 441 models. As the ASSISTments 
system measures completion of a Skill Builder as three 


consecutive correct answers, a number of high performing 
students had limited opportunity counts within skills. For students 
with too few data points, it was not possible to calculate R^ and 
AUC. Therefore, student level analysis incorporated 7,429 
students from the original dataset, or 651,849 problem logs. 
Answering our second research question, it appears as though the 
region of optimal partial credit values observed at the problem 
level remains consistent at the student level, as shown in Figure 2. 
ANOVA results depicted in Table 2 show reliably significant 
differences across attempt penalty models but not across hint 
penalty models. 

Skill level analysis was also undertaken using a subset of the 
original data file. One skill did not have enough data based on a 
low number of users and high mastery within those users, and was 



Figure 1. Problem Level RMSE 



Table 2. ANOVA results for groups of attempt and hint 
penalty models at each level of analysis 


Attempt Penalty Hint Penalty 


Level 

Min 

Max 

F 

P 


F 

P 


Problem 

RMSE 

.430 

.435 

302.70 

.000 

.935 

0.95 

.519 

.043 

AUC 

.626 

.655 

295.46 

.000 

.934 

1.14 

.304 

.052 

R" 

.070 

.091 

304.34 

.000 

.935 

0.95 

.525 

.043 

Student 

RMSE 

.424 

.429 

222.49 

.000 

.914 

1.34 

.149 

.060 

AUC 

.578 

.593 

208.19 

.000 

.908 

1.42 

.106 

.063 

R" 

.096 

.110 

374.52 

.000 

.947 

0.80 

.715 

.037 

Skill 

RMSE 

.423 

.429 

517.85 

.000 

.961 

0.55 

.944 

.026 

AUC 

.624 

.647 

250.17 

.000 

.923 

0.72 

.805 

.033 

R" 

.073 

.090 

510.96 

.000 

.961 

0.49 

.971 

.023 


Note. For all models, df = (20, 420). 


Figure 2. Student Level RMSE 



Figure 3. Skill Level RMSE 
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excluded from skill level analysis, resulting in a file with 119 
skills. At this granularity, goodness of fit metrics were calculated 
for each skill and averaged across all skills to obtain final metrics 
for each of the 441 models. Results are depicted in Figure 3. The 
heat map shows that the region of optimal penalization has grown 
more concise, showing optimal fit among models with low per 
hint and per attempt penalties (< 0.3). ANOVA results depicted in 
Table 2 again suggest reliably significant differences in all metrics 
across attempt penalty models but not across hint penalty models. 

Post-hoc analyses were conducted on ANOVA results using 
multiple comparisons to examine significant differences between 
attempt penalty and hint penalty model groups when considering 
problem level AUC. Using a Bonferroni correction to reduce Type 
I error, this process resulted in a series of significance estimates 
for penalty group comparisons (i.e., all models where attempt 
penalty is 0.1 compared to all models where attempt penalty is 0.3 
results in a non-significant difference, p = 0.88). Results 
suggested that models close in penalty were less likely to differ 
significantly than models with greater difference in penalty. For 
instance, models with an attempt penalty of 0.1 were significantly 
different than those with an attempt penalty of 0.4, but were not 
significantly different than those with an attempt penalty of 0.2. 
This information can be used to help optimize partial credit 
penalizations, as it may be more motivating and productive for 
students to receive smaller penalizations. Such information could 
also allow systems like ASSISTments to define a range of 
possible penalizations that could then be refined by the teacher, 
providing all users with a greater sense of control. 

5. DISCUSSION & CONTRIBUTION 

The initial findings of a grid search on partial credit penalization 
through per unit hint and attempt docking suggest that the 
implementation of partial credit within adaptive tutoring systems 
can be established using a data driven approach that will 
ultimately produce stronger predictive models of student 
performance while enhancing the way adaptive tutoring systems 
are used by students and teachers. 

Our first research question was answered with a resounding 
“Yes,” certain algorithmically derived combinations of partial 
credit penalization are better than others when used to predict next 
problem performance. Optimal partial credit models were visible 
in heat maps spanning three levels of data granularity and 
remained relatively consistent across granularities, thus answering 
our second research question. ANOVAs revealed that differences 
in attempt penalty models were consistently significant across 
dataset granularities, while differences in hint penalty models 
were not reliable. This finding is likely due to the fact that hint 
usage is lower and less distributed than attempt count across 
problems in the dataset, and it is possible that this finding would 
diminish in a system that more readily promoted the use of 
tutoring feedback without penalization, or a system already 
employing partial credit scoring. 

The partial credit models that we define here as optimal, based 
on their ability to predict next problem performance, were models 
with per hint and per attempt penalties of 0.3 or less. Additional 
analyses revealed that at the problem level, there should be no 
reliable difference in predictive ability of a model penalizing 0.3 
per attempt from a model penalizing 0.1 per attempt, with variable 
hint penalization. This finding suggests that less penalization is 
just as effective, offering an opportunity to consider student 
motivation and affect when defining a partial credit algorithm. 
This grid search also revealed that partial credit metrics 
outperform binary metrics when predicting next problem 


performance, as previously shown in [6]. Thus, it is possible to 
improve prediction of student performance within adaptive 
tutoring systems simply by implementing partial credit scoring. It 
should also be noted that a leading limitation of the approach 
presented here is that we have only been predicting next problem 
correctness, rather than latent variables such as skill mastery or 
student knowledge. It is possible that optimizing partial credit 
would also provide benefits for the prediction of latent effects, but 
further research is necessary in this domain. 
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